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Abstract 

This study and discussion center upon the use of YouTube’s automatic captioning feature with 
college-age adult readers. The study required 75 participants with college experience to view 
brief middle school science videos with automatic captioning on YouTube and answer 
comprehension questions based on material presented auditorily and/or through the automatic 
captions. Participants were divided into groups and presented with the captioned videos with or 
without sound. The videos, which all focused on the solar system, contained low and high 
instances of errors within the captions. The research found that comprehension of the automatic 
caption text varied significantly based on how the participants viewed the videos, with 
significantly more errors in comprehension for the group that viewed the high error video with 
automatic captioning only. 
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Introduction 

Since Google’s acquisition of YouTube in 2006, both web giants have been working on 
developing a captioning method to make web-based video content accessible for deaf and hard of 
hearing users. YouTube currently offers users who post videos the option—which YouTube 
strongly encourages—to add subtitles and captions to their video. 

https://support.google.com/youtube/answer/27347967rdM (3 Play Media, 2014). Also available 
to users is an automatic captioning feature. The automatic captioning feature is based on speech- 
recognition technology that employs a complex statistical model for the probability of specific 
sounds, words, and word combinations occurring within a language. According to Google’s 
YouTube Help site, automatic captioning is available in 10 languages worldwide ( Google , 2015). 
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While attending class, and preparing university courses that include students who are 
deaf and/or hard of hearing, the authors noted the volume of errors in the automatic captioning 
present in several of the videos posted and viewed in YouTube. After some discussion, the 
authors decided to conduct a preliminary investigation into the overall effectiveness of the 
automatic captioning tool. To detennine the consistency of YouTube ’’s automatic captioning 
feature of online videos, 50 videos targeted at a middle-school audience were viewed. In each of 
the videos, a variety of errors and error types were documented. Though some error types were 
more prominent than others, each error type plays a role in the overall comprehension of the 
video content. Errors were divided into 11 categories during the review and included addition of 
words, deletion of words, coherent miscues, incoherent miscues, spelling, incomprehensible 
phrases, word condensing, homonyms, approximation (content errors), speed, and visual 
readability. Videos with audio of 1) non-native English speakers with accents, 2) young 
children’s voices, 3) voices that contained mumbling or computer-generated/mechanical sounds 
were harder to understand and had more captioning errors. The speed of the videos and timing of 
the captioned content also proved problematic. Readability of the captioned content and 
aesthetics of the captioning were also noted for each video. 

The results of the initial project were gathered by watching the first two minutes of the 
50 videos focusing on the solar system with the automatic captioning feature enabled and with 
content from the 8 th grade Texas Essential Knowledge and Skills (TEKS) goals. The frequency 
and type of errors were documented along with the quality, speed, and type of video. The data 
were then put into an Excel spreadsheet in preparation for the data analysis. The errors that were 
documented from each phrase were categorized as follows: additions, deletions, coherent 
miscues (full-phrase), incoherent miscues (full-phrase), miscues of a single word, word 
condensing, homonyms, approximations, and morphemes. These categories were established to 
set up guidelines for what was to be considered an error. This initial evaluation of YouTube’s 
content prompted the current study. 

Review of Literature 

Effects of Captioning 

The discussion of captioning audio-visual material must focus on more than simply 
attempting to present textual representations of audio content on the video. Simply putting text 
on the screen is insufficient for providing equitable access to the audio content. Successful 
captioning has been an appropriate supplement to video-based materials and has even been 
shown to be useful as a foreign language instructional tool when used with videos containing 
native speaker accents (Dabhi, 2004). Captioning in different fonnats, including keyword 
captioning, in which students view the video with partial captioning using only pre-selected 
keywords while listening to a video at the same time, has proven effective with users of video- 
based content, especially when the complexity of the video content is beyond the reading level of 
the viewer (Ruan, 2015). In such instances, the captioned content can help clarify the viewer’s 
understanding of the video content presented. Lewis & Jackson (2001) found that the script 
comprehension for captioned videos for students who were deaf or hard of hearing was greater 
than the comprehension of script in other text forms and increased comprehension beyond the 
identified reading levels of students. Verbatim captions that are paced to the natural rate of 
delivery provide access to complete conversational exchanges including both the audio and 
visual infonnation and allow viewers to comprehend both explicit and implicit information. 


Online Learning - Volume 21 Issue 1 - March 2017 


116 



There is an advantage to both deaf and hearing students in tenns of comprehension when 
video and captions are presented (Lewis & Jackson, 2001). Advantages of captioned video 
include facilitating novel vocabulary identification and overall comprehension (Winke, Gass, & 
Sydorenko, 2010). For second language learners, captioning aids in form-mapping, the process 
of connecting spoken and written vocabulary, by not having to focus auditorily on word 
meaning, and instead focus on printed fonn to connect it with meaning. 

Gass, & Sydorenko, 2010). Infonnation presented verbally and visually is integrated as it 
is stored in memory (Sadoski & Paivo, 2004). Johnson-Glenberg (2000) reports that the recall of 
the linguistic information will stimulate retrieval of the visual infonnation and vice versa. 
However, even when captioned material contains enhanced or expanded content, it often goes 
unused by educators despite feedback from students indicating that using captioning while 
viewing video content would be appealing (Steinson & Stevenson, 2015). 

Successful Caption Use 

There are a number of issues involved with successful captioning of audio-video 
content. Two issues affecting the overall captioned experience include speed and fonnatting. 
Jensema & McCann (1995) found that the “safe speed” for word content displayed in captioned 
material was approximately 120-140 words per minute. Unfortunately, captioned material can be 
presented at speeds exceeding 200 words per minute. 

Fonnatting can make the message delivery problematic as well. Closed-captioning 
versus automatic captioning and placement of the text on the video content can also play 
significant roles in how the captioned content is understood. Closed-captioning involves 
embedded textual content (by an author/programmer) that is timed to present with the audio 
content synchronously. Closed captioning can be presented live, as with television programming, 
or post-production, as with cinema movie content, and is not visible until the user activates their 
decoding systems and displays the captions on their screens. New television sets and video 
display monitors sold in the United States must be equipped with built-in caption decoder chips. 
Schools, colleges, libraries and other recipients of federal financial assistance are required under 
section 508 of the Rehabilitation Act to make their communication accessible to and usable by 
persons with disabilities (National Association of the Deaf, 2002). Automatic captioning 
involves the presentation of textual material based solely on the success of speech-recognition 
software. The captioned content for automatic captioning is not a pennanent component of the 
video, and often varies in accuracy based on the speech delivery of the audio content. 

Despite the numbers of individuals with disabilities using social media options and 
information and communication technologies (ICTs) today, many individuals still struggle with 
issues related to accessibility (Seale, Georgeson, Mamas, & Swaim, 2015; Asuncion, Budd, 
Fichten, Nguyen, Barile, & Amsel, 2012; Fichten, Asuncion, & Scapin, 2014). In response to 
increased population with hearing loss who use Internet-based technologies and social media, 
YouTube has created an automatic captioning feature. The feature attempts to approximate text- 
based representations of the speech audio content within videos through the use of speech 
recognition software. Sadly, researchers and companies, including Google, have recognized that 
despite the vast resources of Google and YouTube, the automatic captioning can fail to accurately 
convey the intended message (Barton, Bradbrook, & Broome, 2015; Johnson, 2014). 
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No current research focusing on the success or comprehension of material with the 
automatic captioning feature of YouTube was found. This mirrors historical trends regarding the 
use of captioning. In fact, according to Cambra, Silvestre, and Leal (2009), the study of deaf 
individuals’ use of closed-captioning on television has not been a research priority in the field of 
deaf education. In a review of the literature of closed-captioning on television, no research on 
the influence of errors within the captioning was located. As captioning for online content is 
relatively new, the issue of Web-based captioning has not become a priority in deaf education 
either. 


Successful use of captioned video material requires a significant cognitive load visually 
even when the captioned material is presented appropriately. Cognitive load can be explained as 
a complex theory that attempts to quantify the burden that perfonning a specific task imposes on 
the cognitive system of a learner (Paas & van Merrienboer, 1994; Paas, Tuovinen, Tabbers, & 
Van Gerven, 2003). In relation to this study, cognitive load occurs when a viewer relying on 
captioned material to access the content presented auditorily must attempt to access both the 
captioned content and the video-based content simultaneously. The brain is taxed visually and 
the two modalities compete for delivering the material to the brain via the same cognitive space. 
That cognitive competition can add stress to the situation making comprehension more difficult, 
especially for those with limited reading proficiency. 

The primary focus for this study is based in part on the assertion by Cambra, Silvestre, 
and Leal (2009) that reading comprehension and reading speed influence comprehension of 
closed-captioned text of video content. Because of the complexity of successfully navigating 
captioned video material, comprehension of such material requires individuals with successful 
reading skills. The proposed research seeks to evaluate the comprehensibility of YouTube ’’s 
automatic captioning feature based on the reading abilities of college-educated adult readers. The 
research question for this study is, “Can college-level adult readers understand basic concepts 
from science videos posted on YouTube with auto-captioned text containing errors?” 

Methodology 

Participants 

Participation in the study was open to students, staff, and faculty at a doctoral-granting, 
public university in the southwestern US (Table 1). To locate participants with successful 
reading skills, participants had to have had some college experience in order to participate in the 
study. The researchers anticipated that participant ages would vary and range from 18 to 60+ 
years. All genders of adult students, staff, and faculty at the university were allowed to 
participate. Study participation was dependent on reading ability. 

The study sought to determine whether individuals with college-level reading abilities 
could understand the printed messages of content delivered via the YouTube automatic 
captioning feature. Participants were required to have the ability to read Web-based automatic 
captioning and simple sentences and questions, as well as write their own answers to the 
questions about the videos. 

Participants for the research study were from the university community where the research 
was conducted. Researchers emailed an invitation to university students, faculty, and staff on the 
primary campus of the university regarding participation in the study. Participants were invited 
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to participate in the study using one of the on-campus computer classrooms on the university 
campus. Participants completed a consent fonn before participation was pennitted. 

The research team hosted and monitored the study participants in a university computer 
classroom/lab using an Internet-connected computer and paper questionnaire. Demographic 
infonnation was collected at the beginning of the classroom portion of the research (Table 1). 
The vast majority of participants were female (94%), with English as a first language (81.2%), 
and were hearing (95%). Only three individuals self-identified as deaf or hard of hearing. 
Participants watched one of three middle-school videos about the solar system and answered 
basic questions about the material presented via one of the three viewing options. Using a pen 
and a paper questionnaire, participants answered a series of questions for one of three video 
options on the solar system. The video selection, caption availability, and sound availability were 
chosen at random for each participant and the sound was turned off or on accordingly. The 
research occurred over a two-week period. Participants were given the option of selecting a date 
they wished to participate. Participants did not know which video or under which viewing 
conditions they would be watching the video until they sat down at the computer. Each of the 
computer stations was randomly set to view one of the videos under a specific viewing option. 
The questionnaires were numbered to indicate which video and viewing condition. The 
computers were set up to view the corresponding video under the specific viewing condition. 
Students self-selected where they sat upon entering the classroom. 

Originally, participants were to view one of three middle school science videos on the 
solar system labeled as having captions with “few” errors, “moderate” errors, or “high” errors. 
The error types used in detennining the groupings of “few-error, moderate-error, and high-error” 
status were identical to those used at the initial stages of the project to determine the types of 
errors present in the automatic captioning. Videos were selected from the initial evaluation of 50 
videos. Videos were categorized into one of the three groups, based on the number of errors 
present. To select the videos for the study, the authors used the videos with the highest, median, 
and lowest numbers of errors. Videos were then re-evaluated to detennine which in each group 
contained standard American English speech from a human (not digitized or robotic speech) with 
limited to no accents so that the only influence on the automatic captioning was the quality of the 
speech recognition software. Due to limited participation, the authors focused the initial viewing 
sessions on the videos with “few” and “high” errors. No participants viewed the videos with 
“moderate” instances of errors. Participants for each session viewed the videos similarly under 
one of the following conditions: 1) sound without captioning, 2) sound with automatic 
captioning, or 3) automatic captioning without sound (Appendix A). Determination of which 
participant sessions were given a particular viewing method were determined by random 
selection. Participants who seated themselves at computers where the sound was enabled for the 
video viewing were provided a new pair of earphones to use throughout the experience. 
Participants were free to take the earphones with them once their participation was complete. 
Sessions were arranged so that approximately the same number of participants viewed each 
video option. 

Each of the YouTube videos were watched online during each viewing. The Web 
addresses for the videos are listed at the end of this document and in the attached video 
questionnaires (Appendix B). Participants viewed the pre-detennined video on a university 
computer. Participants were free to watch the video and answer the questions as they wished. 
They could watch the video entirely and then go back and answer questions or they could answer 
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questions while watching the video. While participants were only allowed to participate one time 
in the study, there were no limitations on the number of times they could watch the video or 
pause and go back during that participation. The researchers had the YouTube site open to the 
appropriate video upon participants’ arrival to the computer lab. 

Results 

Participants 

Frequencies and percentages for the demographic variables are displayed in Table 1. The 
largest percentage of participants were in group 2 (20.3%) and the majority of participants were 
in low error groups (52.7%). In addition, the largest percentage of participants were in the sound 
and caption group (35.1%). Finally, the majority of participants were female (94.0%), reported 
that English was their first language (81.2%), and had identified themselves as hearing (95.1%). 

Means and standard deviations for the continuous variables are displayed in Table 2. 

Participants’ ages ranged from 18 to 59 (M= 22.55, SD = 6.14) and the number of years 
of college experience ranged from 0<1 to 11 (M = 3.21, SD = 1.83). The number of correct 
responses by participants ranged from 1 to 12 (M= 8.76, SD = 2.88) 


Table 1. Frequencies and Percentages of Categorical Demographic Variables 



n 

% 

Total Groups 



Group 1 (Low Error, Sound No Caption) 

11 

14.9 

Group 2 (Low Error, Sound and Caption) 

15 

20.3 

Group 3 (Low Error, Caption No Sound) 

13 

17.6 

Group 7 (High Error, Sound No Caption) 

13 

17.6 

Group 8 (High Error, Sound and Caption) 

11 

14.9 

Group 9 (High Error, Sound and Caption) 

11 

14.9 

Error Groups 



Low Error 

39 

52.7 

High Error 

35 

47.3 
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Caption Group 


Gender 


Hearing Status 


Sound No Caption 

24 

32.4 

Sound and Caption 

26 

35.1 

Caption No Sound 

24 

32.4 

Female 

63 

94.0 

Male 

3 

4.50 

Transgendered 

1 

1.50 

t Language 



No 

13 

18.8 

Yes 

56 

81.2 

Deaf 

1 

1.6 

Hard of Hearing 

2 

3.3 

Hearing 

58 

95.1 


Table 2. Means and Standard Deviations of Continuous Variables 



n 

M 

SD 

Min 

Max 

Age 

69 

22.55 

6.14 

18.00 

59.00 

College Experience (Years) 

69 

3.21 

1.83 

0.00 

11.00 

Number Correct 

72 

8.76 

2.88 

1.00 

12.00 
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Table 3 Means and Standard Deviations for the Number of Correct Responses by Gender and Group 



Low Error 
n Mean 

Higher Error 
n Mean 

Total 

n 

Mean 

Sound No Caption 

11 

10.45 a 

13 

8.85 b ’ y 

24 

9.58 



(1.13) 


(1.68) 


(1.64) 

Sound and Caption 

15 

10.07 

11 

9.00 y 

26 

9.62 



(1.10) 


(2.90) 


(2.08) 

Caption No Sound 

13 

9.69 a 

9 

2.78 b ’ x 

22 

6.86 



(21.45) 


(1-72) 


(3.80) 

Total 

39 

10.05 

33 

7.24 

72 

8.76 



(1-23) 


(3.48) 


(2.88) 


Note. Standard deviations are shown in parenthesis below means. a b rows with different superscripts 
differed significantly, x y columns with different superscripts differed significantly 


Data Analysis 

A 2 (error: low vs. high) x 3 (caption: sound no caption vs. sound and caption vs. caption 
no sound) two-way analysis of variance (ANOVA) was conducted to examine the effect of error 
group and caption group on the number of correct responses. 

There was a statistically significant interaction between error group and caption group, F 
(2,66) = 19.69, p < .001, partial eta squared = .374. 

Simple main effects analyses revealed that participants in the low error group who 
watched the video with caption but no sound answered a significantly greater number of 
questions correctly (M= 9.69, SD = 1.44) than participants in the high error group who watched 
the video with caption but no sound (M = 2.78, SD = 1.72), p < . 001 (Figure 1). In addition, 
participants in the lower error group who watched the video with sound but no caption answered 
a significantly greater number of questions correctly (M= 10.45, SD = 1.13) than participants in 
the high error group who watched the video with sound but no caption (M= 8.85, SD = 1.68). 

Simple main effects analyses also revealed that participants in the high error group who 
watched the video with caption but no sound answered a significantly fewer number of questions 
correctly (M = 2.78, SD = 1.72) than participants in the high error group who watched the video 
with sound and caption (M = 9.00, SD = 2.90) and participants in the high error group who 
watched the video with sound but no captions (M= 8.85, SD = 1.68) (Figure 2). 
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Sound No Caption Sound and Caption Caption No Sound 

■ High Error Low Error 


Figure 1: Comparison of comprehension errors between high and low captioning error 



groups based on viewing options 


Figure 2: Comparison comprehension low error frequency versus high error frequency based on 
viewing option 


For the low error group, the three caption groups did not significantly differ in total 
correct scores. For the high error group, the caption no sounds group had significantly fewer 
correct scores than the sound no caption and the sound and caption groups (Figures 1 & 2). 

Limitations and Future Directions 

This study was foundational research into the effectiveness of web-based automatic 
captioning for successful adult readers. Limitations to the study included a predominantly female 
participant group, stemming primarily from the make-up of the university’s student body. 
Another limitation of the study was the use of successful readers. Future research should include 
the use of automatic captioning with struggling readers such as school-age deaf and hard of 
hearing students who demonstrate a variety of reading levels and who would typically be 
exposed to educational content similar to the content presented in this study. The extent to which 
such readers struggle with problematic automatic captioning should also be evaluated. While this 
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research explored the abilities of individuals with native English reading experiences, future 
research should also include participants with limited English abilities to determine whether even 
accurate automatic captioning is problematic and the extent to which it is troublesome. 


Discussion 

There are several points of discussion based on the results of the study. Overall, the 
results indicated that the number of questions participants were able to answer correctly did not 
depend only on the type of error or caption group in which they were placed, they also depended 
on the interaction between the two types of groups. Participants in the high error group who 
watched a video with caption but no sound answered a significantly fewer number of questions 
correctly than participants in the high error group who watched a video with sound but no 
caption or both sound and caption. 

The results clearly demonstrate that when auto-captions are presented accurately, 
regardless of whether sound is present, typical adult readers are able to comprehend the 
messages being delivered via text. Conversely, as assumed, when automatic-captions are 
presented inaccurately, containing significant numbers of errors, and when no audio content is 
available, even hearing, college-educated adult readers are unable to comprehend the messages 
being delivered. These results are significant in that participants in this study all have reading 
experience at the postsecondary level and were unable to accurately perceive automatic- 
captioned messages with high errors that were delivered without sound. If adult readers with the 
ability to interact with print at the college level are unable to successfully navigate such 
automatic-captioned content, expecting school-age students who are deaf or hard of hearing with 
varied reading abilities to perfonn any better would be inappropriate as caption comprehension is 
highly and positively correlated with grade level (Lewis & Jackson, 2001). 

Captioning is critical for video to be accessible to students who are deaf or hard of 
hearing. However, the focus must be more than simply putting text on a screen, as appropriate 
captioning is imperative for successful comprehension. YouTube Teacher was created to help K- 
12 teachers use educational video in their classrooms to support learning and engage and inspire 
learners. YouTube for Schools allows schools that opt-in to access thousands of educational 
videos (Buzzetto-More, 2014). Clearly the use of multimedia and the presentation of verbal and 
visual infonnation will continue to be a common and recommended model of instruction in 
classrooms. Teachers need to become adept at using captioning accurately. As indicated earlier, 
teachers can manually caption their videos using several of the tools available in YouTube’s 
video manager. 

Accurate captioning is equally important for access at the university level. Faculty are 
using and recommending the use of online videos such as those found on YouTube (Moran, 
Seaman, & Tinti-Kane, 2011; Tan & Pearce, 2011). Betts, Cohen, Veit, Alphin, Broadus and 
Allen (2014) identified inaccessibility to videos and voice-over PowerPoint Presentations 
because they do not have captions as one of the greatest challenges for an online student with a 
hearing loss. Such recommendations carry the weight of assuring that videos for courses are 
accessible to all students. Relying on the automatic captioning feature of YouTube will be 
insufficient to provide student users who require the captioned text for comprehension. 
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Captioning is vital but needs to be studied. We are not suggesting getting rid of it—only 
finding out what works best. Further discussion regarding the automatic captioning of web-based 
video content should center on several issues, including mandating which web-based content 
should be permitted to employ automatic captioning features and how to improve upon the 
infrastructure of automatic captioning and speech recognition platforms. 
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APPENDIX A: VIDEOS & VIEWING CONDITIONS 

Group 

# Description 

VIDEO ONE LOCATION: httn://www.YouTube .com/watch?v=BlAXbpYndGc 

1 Video l(low error) sound no caption 

2 Video l(low error) sound and caption 

3 Video l(low error) caption no sound 

* VIDEO TWO LOCATION: http://www.7oM Tube .com/watch?v=RJOJCg3S7xO 

4 Video 2(medium error) sound no caption 

5 Video 2(medium error) sound and caption 

6 Video 2(medium error) caption no sound 

VIDEO THREE LOCATION: http://www.7o uTube .com/watch?v=tDnawSi64j q 

7 video 3(high error) sound no caption 

8 Video 3(high error) sound and caption 

9 Video 3(high error) caption no sound 

*No participants viewed this group of videos. Originally, the medium-error videos were to be viewed. 
When it became apparent that the number of participants were going to be limited, the authors chose to 
focus attention on the high-error and low-error videos. 
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APPENDIX B: VIDEO VIEWING QUESTIONNAIRE EXAMPLE 
Video Questionnaire: GROUP 1 

“Naked Science: Birth of the Solar System” 
http://www.voutube.com/watch?v=BlAXbpYndGc 


1. A dense clump of water formed what? 

2. When a star reaches 18 million degrees Fahrenheit what kicks in? 

3. When was our star(the sun) bom? 

4. What fuses together to form helium? 

5. What is the first type of light made by our sun? 

6. Was the solar system’s birth peaceful? 

7. Where was the sun bom? 

8. An entire universe was supposed to be created from what? 

9. A big explosion, that caused the creation of the universe, is known as? 

10. There are how many naturally occurring chemical elements? 

11. What are two elements that planets are made of? 

12. Hydrogen and Helium fuse to make what? 
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YouTube Automatic captioning 


Participant # 
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YouTube Automatic captioning 


Participant # 
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Gender: M F T Is English your first language? Y N Age: 

# of years college experience? Hearing status: Hearing Deaf Hard of Hearing 


YouTube Automatic captioning Participant # 
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